Python 演算法 Day 10 - Feature Selection

自學筆記

s790502ss 2021-11-08 21:41:05 ‧ 3888 瀏覽

分享至

Chap.II Machine Learning 機器學習

https://yourfreetemplates.com/free-machine-learning-diagram/

Part 2. Feature Engineering 特徵工程

在開始跑演算法前，會藉由特徵工程提高準確率、優化收斂速度。
常用的特徵工程為以下三種：

2-1. Feature Scaling 特徵縮放：

將 scale 縮放，達到方便辨識。（常用: 'Normalization' & 'Stadardization）

2-2. Feature Selection 特徵選擇：

將與 y 強相關的 x 選擇出來，透過減少互相干擾、或預測能力差的 x 變數，達到加快演算。
常見方法有 SBS、Random Forest...等。
EX. 鐵達尼號中，將 'alone' （與 'sibsp', 'parch' 干擾且重複）刪除。

2-3. Feature Extraction 特徵萃取：

將與 y 強相關的 x 選擇出來，透過揉合數個 x 變數（將數個變數揉合為單個），達到加快演算。
常見方法有 PCA、TSVD、T-SNE...等。
EX. 鐵達尼號中，將 'sibsp', 'parch' 揉合成 'family_size'。

特徵工程的 2&3 又稱 Dimensionality Reduction 降維，好處為：

A. 精度改進。
B. 過擬合風險降低。
C. 加快訓練。
D. 改進的數據可視化。
E. 增加模型的可解釋性。

剛剛優點中的名詞，＂Overfitting 過度擬合＂是甚麼?

Overfitting 過度擬合：模型受到訓練資料影響過大，使其預測測試資料時效果不佳。
Underfitting 低度擬合：模型對資料的描述能力太差，無法正確解釋資料。

至於造成過擬合的原因，要從偏差或變異說起。

什麼是偏差（Bias）？什麼是變異（Variance）？

Bias 偏差：指的是預測值與實際值的差距。（打靶打得準）
Variance 變異：指預測值的離散程度。（打靶打得精）

理論上，我們會希望把 Model 訓練的＂既準又精＂，使它可直接描述數據背後的真實規律、意義。
以便後續用它來執行一些描述性或預測性的任務。

然而，實作上就有以下：
1.隨機誤差（Random error）
2.偏差（Bias error）
3.方差（Variance error）。

隨機誤差源於數據本身，基本無法消除。
而 Bias 與 Variance，又跟 Overfitting & Underfitting 的問題息息相關。

那麼，把 Bias error 跟 Variance error 都降到最低就好了嗎？

理論上，若有＂無窮的數據＂+＂完美的模型＂+＂究極運算能力＂，是可以達成的！
實際上，我們的數據跟計算能力都很有限，且模型也不可能完美。

打個比方：建模過程中，若想把 Bias error 降到最低，則須建出非常複雜的模型。
等於讓模型把訓練資料特徵全部硬背，連同隨機誤差也全擬合進模型，使模型失去了泛化能力。
這樣的結果，就稱＂Overfitting 過度擬合＂。
一旦過擬合，對於未知的資料預測的能力就會很差，造成高 Variance error。

*模型的複雜度與模型預測的誤差

為了避免過擬合，在訓練模型時，會將資料集拆分成 training & testing（training 中再拆分 validation）。
再透過調整超參數（Hyperparameter）來改變模型，以適配不同的資料。

但現在，還是先回到特徵工程上。

PS. 因特徵縮放在 Day13 已經有稍微提過，以下會著重在特徵選擇 & 特徵萃取。

2-2. Feature Selection 特徵選擇

特徵選擇上，有幾種方式可幫助我們判斷/選取，以下提到 SBS & RandomForest。

A. Sequential Backward Selection (SBS) 循序向後選擇

以鳶尾花作為例子，見以下

# 1. Datasets
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

ds=load_wine()

X=ds.data
y=ds.target

X.shape, y.shape
>>  ((178, 13), (178,))


# 4. Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)


# 5. Learning Algorithm
# 6. Traing Model
# 7. Score Model
from sklearn.metrics import accuracy_score

def calc_score(X_train, y_train, X_test, y_test, indices):
    
    # Choose Regression
    LR = LogisticRegression()
    print(indices, X_train.shape)
    
    # Fit model
    LR.fit(X_train[:, indices], y_train)
    y_pred = LR.predict(X_test[:, indices])
    
    # Score model
    score = accuracy_score(y_test, y_pred)
    
    return score

接著運用迴圈迭代各個排列組合，計算跑分：

from itertools import combinations
import numpy as np

score_list = []
combin_list = []

best_score_list=[]

# 外迴圈：dim = 1~13
for dim in range(1, X.shape[1]+1):
    score_list = []
    combin_list = []
    
    # all_dim = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
    all_dim = tuple(range(X.shape[1]))

    # 內迴圈：C 13 取 n，n 從 1~13
    for c in combinations(all_dim, r=dim):
        score = calc_score(X_train, y_train, X_test, y_test, c)

        # 分數加入 score_list，跑合加入 combin_list
        score_list.append(score)
        combin_list.append(c)

    # 找出最高分的項次
    best_loc = np.argmax(score_list)
    
    best_score = score_list[best_loc]
    best_combin = combin_list[best_loc]
    print(best_loc, best_combin, best_score)
    
    # 把所有結果最好的丟進 list
    best_score_list.append(best_score)

>>  6 (6,) 0.8539325842696629
>>  5 (0, 6) 0.9325842696629213
>>  278 (8, 9, 12) 0.9662921348314607
>>  65 (0, 2, 4, 6) 0.9662921348314607
>>  120 (0, 1, 5, 8, 9) 0.9662921348314607
>>  71 (0, 1, 2, 5, 7, 9) 0.9662921348314607
>>  59 (0, 1, 2, 3, 6, 9, 11) 0.9775280898876404
>>  66 (0, 1, 2, 3, 5, 6, 9, 11) 0.9775280898876404
>>  107 (0, 1, 2, 3, 6, 7, 8, 9, 12) 0.9775280898876404
>>  232 (1, 2, 3, 4, 5, 6, 8, 9, 11, 12) 0.9775280898876404
>>  68 (1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12) 0.9775280898876404
>>  7 (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12) 0.9662921348314607
>>  0 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) 0.9438202247191011

將 best_score_list 視覺化：

import matplotlib.pyplot as plt

No = np.arange(1, len(best_score_list)+1)
plt.plot(No, best_score_list, marker='o', markersize=6)

從圖中可知，選 7 項變數（97.7%）來演算結果，與選 11 項（97.7%）相近，
且變數變少，大幅提升運算效率。
當然，若再進一步想增加運算效率，也可選用 3 項變數（96.6%）。

B. Random Forest Classifier 隨機森林演算法

以紅酒分類作為例子，見以下

載入必要套件 & Datasets

import numpy as np
import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/'
                      'ml/machine-learning-databases/wine/wine.data',
                      header=None)

# if the Wine dataset is temporarily unavailable from the
# UCI machine learning repository, un-comment the following line
# of code to load the dataset from a local path:

# df_wine = pd.read_csv('wine.data', header=None)


df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

print('Class labels', np.unique(df_wine['Class label']))
df_wine.head()

Split Data

from sklearn.model_selection import train_test_split

# 'Class label' 是 Y
X, y = df_wine.drop('Class label', axis=1), df_wine[['Class label']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

運用迴圈迭代各個排列組合，並計算他們的跑分：

from sklearn.ensemble import RandomForestClassifier

# 載入 wine 的 columns
wine_col = df_wine.columns[1:]

# 隨機森林演算法
forest = RandomForestClassifier(n_estimators=500, random_state=1)
forest.fit(X_train, y_train)

# 把每一個變數特徵的重要性列出，從大排到小
ipt = forest.feature_importances_
ipt_sort = np.argsort(ipt)[::-1]

# 依序迭代出重要特徵
for f in range(X_train.shape[1]):
    print(f"{f+1:>2d}) {wine_col[ipt_sort[f]]:<30s} {ipt[ipt_sort[f]]:.4f}")

視覺化：

import matplotlib.pyplot as plt

plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]), ipt[ipt_sort], align='center')

# 以 wine_col 代換掉 x 軸的 0~12
plt.xticks(range(X_train.shape[1]), wine_col[ipt_sort], rotation=90)

# 把圖上下縮短
plt.tight_layout()

plt.show()

柏拉圖式（QC 七工具）

又稱主次因素分析法，是一種條形圖和折線圖的組合，為品質管理上經常使用的一種圖表方法。
其好處是，可以設定一個目標（比方說 80%），將影響最大的幾個因子挑出。

var_exp = ipt[ipt_sort]

# 把 ipt 裡的機率逐個加總（最後肯定會是 1）
cum_var_exp = np.cumsum(var_exp)

>>  array([0.20302504, 0.17278228, 0.12686498, 0.12430788, 0.10764943,
        0.0748521 , 0.05569083, 0.04471882, 0.02379331, 0.02336044,
        0.02253831, 0.01137369, 0.0090429 ])

作圖

# Pareto Chart
import matplotlib.pyplot as plt

# 劃出 bar 條
plt.bar(range(1, 14), var_exp, alpha=0.5, label='individual explained variance') #  , align='center'

# 劃出 上升階梯
plt.step(range(1, 14), cum_var_exp, where='mid', label='cumulative explained variance')

plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')

plt.legend(loc='best')
plt.tight_layout()
plt.axhline(0.9, color='r', linestyle='--', linewidth=1)

plt.show()

從圖中可得需要選取至少 8 項特徵，方可包含 90% 影響因子。

當然還有其他方法可以達到特徵選取，可以參考。

到這裡，就完成了特徵選擇的實作！

結論：

特徵選取擁有數種方法，每種都有其優勢。須根據不同場合及資料類型選用。
但後續的特徵萃取（又稱降維），較能有效加速演算及減少變異偏差。
.
.
.
.
.

Homework Answer：

請參考鐵達尼號的流程，使用鑽石清理資料來完成演算法。

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('diamond.csv')
df.head()

df.info()

# 確認 NaN
df.isna().sum()

>>  Carat Weight    0
    Cut             2
    Color           2
    Clarity         1
    Polish          2
    Symmetry        2
    Report          2
    Price           0
    dtype: int64

# 使用前一筆填補 NaN
df = df.fillna(method='ffill')
df.isna().sum()

>>  Carat Weight    0
    Cut             0
    Color           0
    Clarity         0
    Polish          0
    Symmetry        0
    Report          0
    Price           0
    dtype: int64

# 印出每個欄位種類個數
for x in df.columns[1:-1]:
    print(x)
    print(df[x].value_counts())
    print()

>>  Cut
    Ideal              2483
    Very Good          2426
    Good                708
    Signature-Ideal     253
    Fair                129
    VeryGood              1
    Name: Cut, dtype: int64

    ...(中間略)

    Report
    GIA     5265
    AGSL     735
    Name: Report, dtype: int64

# 將明顯是 'Very Good' 但填錯的 'VeryGood' 取代掉
df['Cut'] = df['Cut'].str.replace('VeryGood', 'Very Good')
df['Cut'].value_counts()

plt.figure(figsize=(14, 8))
plt.subplot(2, 3, 1)

# enumerate(): 把 (項次, 內容) 迭代出來，丟進 i 與 x
# 畫出數量圖
for i, x in enumerate(df.columns[1:-1]):
    plt.subplot(2, 3, i+1)
    sns.countplot(x=x, data=df)

# 劃出 'Carat Weight' 克拉重
# sns.distplot(df['Carat Weight'])

sns.distplot(np.log(df['Carat Weight']))

# 'Carat Weight' 無異狀
sns.boxplot(df['Carat Weight'])

# 'Price' 發現有離群點
sns.boxplot(df['Price'])

# 把 <= 平均價格+2*價格標準差 以外的異常點排除
df = df[df['Price']<=df['Price'].mean()+2*df['Price'].std()]
sns.boxplot(df['Price'])

餘下的部分就選一個演算法進行跑分即可~
.
.
.
.
.

Homework：

試著用 sklearn 的資料集 breast_cancer，操作 Featuring Selection (by RandomForest)。

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22207 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙